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Abstract 

We formulate genome assembly problem as an opti- 
mization problem in which the objective function is 
the likelihood of the assembly given the reads. 

Authors Note Since the publication of the first 
version of this manuscript, we have discovered that 
the same technique has been previously published by 
Varma, Ranade and Aluru[l]. Please credit that pub- 
lication for the convex optimization formulation of 
maximum likelihood assembly. 

Below, we list the main differences between this 
work and the work of Varma et al. We hope that 
some may be of additional interest. 

• The assembly graph: For the most part, we use 
an assembly graph very similar to string graph 
(which is used by Varma, et al.), with one im- 
portant difference; In the construction of our as- 
sembly graph, to collapse two edges, it is not 
sufficient that they are incident on a vertex with 
in-degree and out-degree of one. Using the op- 
timization formulation, we explain why each of 
the graph reduction operations used during con- 
struction of a string graph are (or are not) jus- 
tified. (Section gj) 

• Assembly length: We assume assembly length is 
known, and is a constant in the optimization. 
This is due to the fact that if L was a variable 
in (H])-©, for any solution with a particular ob- 
jective value, one can construct an infinite set of 
solutions with the same objective value by scal- 



ing the variables Xi 



and L. (Section [2]) 



• Calculating probability of a query sequence us- 
ing the optimal (fractional) solution: Given the 
assembly graph, and an optimal solution (which 
provides fractional values for the edges), we give 
a dynamic programming algorithm to calculate 
the probability of a query sequence that can span 
several edges of the graph. Note that there is no 
need to round the optimal solution for this algo- 
rithm. (Section [SJ) 

• Rounding the optimal solution, and constructing 
a contiguous assembly: We describe a random- 
ized rounding procedure that produces an Eulc- 
rian graph. This procedure will further simplify 
the solution, and possibly discard very low prob- 
ability edges and vertices. All Eulerian cycles of 
the rounded solution will have equal likelihood. 
(Section H) 

• Errors in the reads: We give a short recipe 
for (approximately) generalizing the convex op- 
timization formulation to a sequencing model 
which allows errors in the reads. (Section [51) 

• Using a generalized suffix tree as an assembly 
graph for exact reads: Assuming the reads are 
absolutely error-free, we observe that we can use 
a suffix tree (with suffix links) as the assembly 
graph. The suffix tree, in the worst-case has lin- 
ear size and can be constructed in linear time. 
Whereas the memory and computation required 
to build the string graph may be quadratic, be- 
cause the number of overlaps is quadratic in the 
worst case. (Appendix IA"!) 
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1 Introduction 

The likelihood of an assembly is proportional to 
the probability of observing the sequenced reads, if 
so many reads were generated from the assembled 
sequence according to the sequencing model[21 Q]. 
Therefore, given a set of reads R, the genome as- 
sembly problem^ is to find a superstringj A, which 
maximizes the probability of observing R: 



pt[r\a}=h(^) 



(1) 



appears m 



where rii is the number of times read i 
A, and L is the length of A. 

Consider the prefix graprjf] of the read sequences in 
R. This graph has the following properties: 

• It is directed. 

• Each read corresponds to a vertex. 

• The length of each edge (denoted by Uj) is the 
length of the shortest prefix of read i, such that 
the remainder of read i is a prefix of read j. If 
two or more reads have the same sequence, they 
are connected by a zero length path. (Not a zero 
length cycle.) 

This is a complete directed graph, and all vertices 
have self-loops. Even if two reads do not overlap at 
all, there is an edge between them labeled by the 
sequence of the first read. (These longer edges cor- 
respond to breaks in the assembly. We refer to them 
as break edges.) The edge lengths satisfy triangle in- 
equality. (However, note that the edge lengths are 
not a metric. In particular, in general Zy ^ Iji, and 



1 For the sake of simplicity we use the a very basic sequenc- 
ing model in much of this paper. In particular, we assume 
the read sampling is uniform, there are no sequencing errors, 
and the reads have equal lengths. These assumptions can be 
relaxed to some extent, without changing the nature of the 
problem. 

2 In practice an assembly may not include all of the 
reads. This can be handled by slightly adjusting the objec- 
tive function pp. 

3 The prefix graph [5] has been used for finding an approxi- 
mate shortest common superstring. It is similar to the string 
graph[3] without any reduction or simplification. 



always la ^ 0.) It can be shown that any assembly se- 
quence that maximizes ([1} corresponds to some walk 
in this grapt@. We want to find such an optimal walk. 



2 A tractable optimization for- 
mulation 

We first express finding an optimal walk in terms of 
an integer programming problem. We have to assume 
that the length of thegenome being assembled (de- 
noted by L) is knowro Let x^ denote the number 
of times a walk traverses the edge between vertices i 
and j. We obtain the following integer optimization 
problem: 



maxn(|) 

ieR. 

Vi, m = ^ 

3 

L = k 

* 3 

Vk, ^ x kj = 51 Xlk 

3 i 

Xi j G {0,1,2,...} 



(2) 
(3) 
(4) 

(5) 
(6) 



It is essential to understand that this integer pro- 
gram does not perfectly model the assembly problem. 
In particular it does not enforce global contiguity. 
Therefore the optimal solution is an upper bound on 
the probability for the optimal assembly. Also note 
that the graph induced by a solution is Eulcrian (be- 
cause of |5]) ) . But such a graph may have more than 
one connected components. Therefore, any integer 
solution will correspond to one or more cycles. 

Solving an integer programming problem is not fea- 
sible for large instances. Fortunately, we can trans- 



4 The corresponding sequence for a walk in the prefix graph 
is obtained by concatenation of the labels on the edges, and 
the sequence of the last vertex. 

5 In general, the length of the genome can not be estimated 
using random fragments only. However, assuming most of the 
genome is non-repetitive, the length of the genome can be esti- 
mated. There are many exceptions to this assumption; notably 
polyploid organisms. 
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form this problem to a convex optimization problem. 
Let us assume all variables are non-negative real num- 
bers. In this case, ^ should become x^j > 0. Note 
that, in constraint since an optimal solution for 
any value of L can be scaled to an optimal solution for 
all values of L, we can assume L = 1 without loss of 
generality. Let Ui = ^ denote the new variables cor- 
responding to the graph vertices. This means that 
Hi = Pr[i|j4]. Furthermore we introduce new vari- 
ables Zi — logy;, and modify the objective function 
accordingly. 

The new optimization problem is: 



max zi 

i 

Vi,Zi < log yi 
V*, y% = ^2 Xl J 

3 

1 = Xij lij 

i 3 



Xij > o 



(7) 

(8) 
(9) 

(10) 

(11) 
(12) 



This is a convex optimization problem. The objec- 
tive function is linear, and the feasible solution set 
is convex. Because all constraints are linear, except 
([5]), which is convex, and the intersection of several 
convex sets (constraints) is also convex. 

Given the optimal solution for this problem (which 
will, in general, be fractional), we can proceed in two 
ways. We can either use the fractional solution di- 
rectly to answer standard queries on the assembly 
(section [S]) . Or, we can try to round this solution 
and generate a final sequence for the assembly (sec- 
tion [3]) • The rounding procedure must preserve the 
Eulcrian property: for all vertices, in-degree must be 
equal to out-degree. Note that the resulting Eule- 
rian graph may have many tours, all of which will 
have equal likelihood. Therefore any final solution 
(assembled sequence) is, by itself, only one of many 
possible solutions. 



3 Constructing an assembly 
from an optimal fractional so- 
lution 



We will use the same names for the variables of the 
convex program (I71)- ([T^|) and for the value of these 
variables in an optimal solution. In order to round 
the solution from the convex program, we first round 
the values corresponding to the vertices (i.e. yts, rep- 
resenting in-degrec = out-degrec) and then find the 
best set of edges to support them. So at this point we 
know how many times each vertex is to be visited in 
the assembly (we get yiS from the convex solution and 
round them up or down to [Lyi] or \Lyi\ indepen- 
dently at random, and call the rounded value rii, such 
that E[ni] = Lyi), but we don't know which edges to 
take (the convex solution only gives us fractional val- 
ues for edges, we want integral values). We want to 
select edges such that the total length of the assem- 
bly is minimized, while visiting each vertex exactly 
rii times (i.e. in-degree = out-degree = rii). This can 
be formulated as a minimum weight bipartite match- 
ing problem. The edge weights are the edge lengths 
as defined in prefix graph, and we have a degree for 
each vertex. 

We replicate vertex i, rii times. Then translate 
the directed graph to a bipartite graph as follows: 
Duplicate each vertex, put one in each of the two 
parts. Then for each directed edge put an edge from 
its source vertex in the first part to its destination 
vertex in the second part of the bipartite graph. 

This matching problem has an optimal (integral) 
solution that can be found in polynomial time. This 
translates to a (possibly disconnected) set of Eulerian 
graphs embedded in the prefix graph. Any Eulerian 
cycle (or set of cycles) has the same likelihood. 

We have no bound on the approximation factor of 
this algorithm. 
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4 Simplifying the graph 

The size of the prefix graph as described above is 
quadratic in the number of readfl An optimization 
problem of this size is not practical. Fortunately, 
we can reduce the size of the graph while ensuring 
that the optimal solution for the new graph is also 
an optimal solution for the original graph. We will 
first give some theoretical justifications for a set of 
operations that transform any optimal solution on 
the prefix graph such that it avoids certain edges. 
Next we will give a more practical recipe for efficiently 
constructing a simplified graph directly. 

4.1 Optimality-preserving transfor- 
mations 

Consider the value of an optimal solution. In the fol- 
lowing, we show that several kinds of transformations 
of the graph do not change the value of the optimal 
solution. Furthermore, since our transformations in- 
volve removing edges from the graph, any optimal so- 
lution in the "simplified" graph can be easily turned 
into an optimal solution for the original graph. 

• Remove "transitive" edges. If hj + Ijk < hk then 
any optimal solution which uses the edge ik can 
be transformed into a solution of greater than or 
equal likelihood which uses the two edges ij and 
jk instead. (Because, this transformation does 
not increase the total length of the assembly and 
does not decrease the product of rijS.) 

• Unfortunately transitive edge removal does not 
affect the long "break" edges (See Section[TJ that 
exist between every pair of vertices that do not 
have a significant overlap. These edges keep the 
graph nearly complete. Fortunately we can ar- 
gue that most of these edges can removed if a 
better local path exists. In particular for every 
vertex v which is part of a "compact" and "uni- 



6 As a special case, a different kind of graph of linear size can 
be built assuming the reads have no errors whatsoever. This 
is not useful in practice, but for some theoretical amusement 
see section lAl 



form" patrQ, we can remove all incoming and 
outgoing break edges. 

Once these edges are removed, we can collapse 
the paths that have no branches. (Or not!) The 
important point is to note that there is a subtle 
difference between this and the traditional path 
collapsing heuristic] 3]- We do not collapse non- 
uniform paths, even if there are no branches due 
to overlapping reads in them. For example, if the 
original genome contains two chromosomes with 
sequences X and XY, consolidating X and Y 
into a single edge is not allowed. (A single edge 
forces the coverage of X and Y in the assembly 
to be the same, whereas the observed coverage 
of X is twice that of Y.) 

• The remaining "break" edges can be replaced 
by adding a dummy vertex, and rerouting every 
break edge through this vertex. In particular, 
we add an edge from every vertex to this spe- 
cial vertex. These edges have lengths and labels 
equal to that of the reads. We also add zero 
length edges from the special vertex back to all 
vertices. Note that this special vertex does not 
correspond to any variable in the optimization 
problem, but the edges incident on this vertex 
do. 

4.2 Efficient construction of a simpli- 
fied graph 

This section looks very similar to section 14. 11 It 
serves a subtly different purpose though. Previously, 
we started with a complete graph (of size 0(n 2 ) for 
n reads), and transformed it while preserving (the 
value of) an optimal solution (which conceptually had 
been found on the original graph) . In this section we 
try to construct an already simplified graph directly 
from the reads, with all intermediate steps requiring 
no more than close to linear memory. The optimiza- 
tion problem is then solved on the resulting smaller 
graph. 

1 . Only find overlaps of certain significance. 

7 We have a more rigorous theoretical definition and proof 
which is omitted here. 
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2. Remove transitive edges. 

3. Compress paths without any branches (in or out) 
if all the edges on the path have (approximately) 
the same length. (But remember how many ver- 
tices were compressed into each edge.) 

4. Simulate "break" edges between every pair of 
vertices by adding a dummy vertex as described 
in section 14.11 This way, the number of addi- 
tional edges will be linear (instead of quadratic) 
in terms of the number of vertices of the graph. 

This graph, on average, requires approximately 
0(Ld) space to construct, where L is the length of 
the genome and d is depth of coverage, (i.e. the 
space requirement is linear in the size of the input 
reads.) Assuming the structure of the repeats in the 
genome is not very complex, the final graph has size 
proportional to L. 

5 Using an optimal fractional 
solution directly 

We do not need to build an assembly or even round 
the optimal fractional solution of the convex pro- 
gram Q- ([T2} to be able to answer some questions 
regarding the genome. One of the most important 
queries that we can answer using the fractional so- 
lution directly is finding the probability of any given 
sequence being observed in the genome. 

Given X, a fractional solution to (|7))- (fT2|) . and a 
query sequence s, the probability of observing s is 
the sum of probabilities of observing it over all walks 
(of length close to the length of s). That is: 

Pr [ S \X] = Pr i s H Pr [w\X] 

w 

This is not how we actually calculate the probabil- 
ity, but it is useful to separate the two contributing 
factors. The probability of a walk only depends on 
the fractional solution. (But not on the query se- 
quence or the sequencing model.) 

Let us first explain how to calculate the probabil- 
ity of a given walk. The key intuition is to think of 



the fractional solution as the values of a "flow" in 
the edges of the graph. Because of the Eulerian con- 
straint, we are guaranteed that the incoming flow and 
the outgoing flow are equal for every vertex. Then 
the probability of a walk is the amount of flow that 
exactly follows that particular path. 

For example, assume that we have arrived at vertex 
i. The probability of choosing the outgoing edge from 
i to j is: 

Pr[follow(i,j)] = J? ij (13) 
Therefore the probability of a walk that visits i\, 12, 

• ■ • , In IS. 

n-l 

Pr [start(i x )] J[ Pr [follow^, i k+1 )} 

k=l 

As mentioned above, we can not enumerate all 
walks to calculate the probability of a query sequence. 
In the following, we present a dynamic programming 
algorithm that calculates this probability directly. 
This algorithm assumes a sequencing model in which 
only substitution errors are allowecQ 

We define T[x,i,j,y] as the probability of observ- 
ing the prefix [1 . . . y] of s, and ending at position x 
on the edge from i to j . If x > 1, then 

T[x,i,j,y] =Pr [Subst(e ij -[a;] ) s[y])] x 
T[x - l,i,j,y- 1] 

For x — 1, the recurrence depends on the last value 
of all edges ending at vertex i: 

T[l,i,j,y] = Px [Subst(e ij -[l],s[ 2 /])]x 
Pr [follow(i, j)} x 

Y T [ l kt,k,i,y - 1] 
where Pr [follow(i, j)] is defined by (fl"3|) . 

8 We believe this algorithm can be extended to a sequencing 
model that allowed insertions and deletions as well as substitu- 
tions. However, this would require solving (or approximating) 
a big system of linear equations for each letter in the query 
sequence, and is considerably more complex. 
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At the beginning, T[x, 0] is initialized to Xij 
from the fractional solution. At the end, the proba- 
bility of observing the sequence s is given by 

^T[a;,i,j,length(s)] 

Note that we only need to care about the edges 
whose > 0. Furthermore we can use techniques 
in section @] to obtain a graph with fewer edges to 
begin with. Finally, since the recurrence values for 
y only depend on the values for y — 1 we only need 
to keep the last row of the dynamic programming 
table. This means that in the worst case, the memory 
requirements would be linear in the size of the input 
reads. 

6 Reads with errors 

The convex optimization (|Tl)- (fT2|) can be extended to 
the case of reads with errors, with some approxima- 
tions and assumptions. The construction and sim- 
plification of assembly graph is more complicated, 
e.g. For each pair of reads, in the case of error-free 
reads, we only cared about the longest overlap be- 
tween them. In the case of reads with errors, (theo- 
retically) we have to consider all overlaps. Therefore 
there may be multiple edges (of different lengths) be- 
tween a given pair of vertices. Furthermore, since the 
overlaps are not perfect, each edge needs another at- 
tribute in addition to its length: Pij t o the probability 
of observing sequence from read j in the overlap o, 
by sequencing from i. 

Note that Pij >0 is a constant (between and 1) in 
the optimization problem. In fact pij, a is strictly less 
than 1, even if the overlap is perfect. Also, assum- 
ing constant error rate, the shorter the length of the 
overlap, the higher py, . 

Finally, we need to adjust the equation © for j/j 
(the probability of read i, in the convex optimization 
formulation). Note that in the case of a sequencing 
model with errors, a read may not be a substring 
of the assembly sequence exactly, but the read may 
still be "observed" (with lower probability) due to 
sequencing errors. The yi taking into account the 



probabilities of the overlaps is 

A caveat of the new formulation is that we can 
not simply apply the simplification rules in sec- 
tion 3J However, with some approximating assump- 
tions, they can be modified to extend to the case of 
reads with errors. For example, the extended and 
more specific transitive reduction rule is: remove 
the edge from reads i to k, if hu > Uj + Ijk and 
Pik < PijPjk- 

The path compression rule can stay relatively the 
same if we allow for some approximation. Except 
that we have to keep track of the p values for the 
removed edges, in addition to their lengths and their 
number. 

Due to the transitive reduction rule being more 
strict than in the error-free case, we will probably 
not be able to simplify the graph as much as be- 
fore. (And therefore the convex optimization problem 
will be much larger.) However, one could relax the 
Pik < PijPjk condition to p ik (l — fx - da) ik < p lJ p j k 
where \i and a are the mean and standard deviation 
of the rate of sequencing errors. With such heuristic 
assumptions we should be able to simplify the graph 
nearly as much as in the error free case. 
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A Assembly graph for reads 
without errors 

Graph for exact reads can be built in linear space us- 
ing a suffix tree with suffix links. For each read with 
sequence s, add two strings to a generalized suffix 
tree: s$i, and s$ 2 - Each unique read sequence would 
then correspond to a particular internal node of the 
suffix tree. Consider all walks in this graph, that can 
use tree edges or suffix links. The length of each tree 
edge is the length of the sub-string corresponding to 
that edge, and the length of each suffix link is zero. 
We want to find the circular walk of given length in 
the tree that maximizes (a function of) the number 
of times each read node is visited. 
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